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Abstract 


We introduce language-driven image generation, the task of generating an im¬ 
age visualizing the semantic contents of a word embedding, e.g., given the word 
embedding of grasshopper, we generate a natural image of a grasshopper. We 
implement a simple method based on two mapping functions. The first takes as 
input a word embedding (as produced, e.g., by the word2vec toolkit) and maps it 
onto a high-level visual space (e.g., the space defined by one of the top layers of 
a Convolutional Neural Network). The second function maps this abstract visual 
representation to pixel space, in order to generate the target image. Several user 
studies suggest that the current system produces images that capture general vi¬ 
sual properties of the concepts encoded in the word embedding, such as color or 
typical environment, and are sufficient to discriminate between general categories 
of objects. 


1 Introduction 

Imagination, creating new images in the mind, is a fundamental capability of humans, studies of 
which date back to Plato’s ideas about memory and perception. Through imagery, we form mental 
images, picture-like representations in our mind, that encode and extend our perceptual and linguistic 
experience of the world. Recent work in neuroscience attempts to generate reconstructions of these 
mental images, as encoded in vector-based representations of fMRI patterns Ha. In this work, we 
take the first steps towards implementing the same paradigm in a computational setup, by generating 
images that reflect the imagery of distributed word representations. 

We introduce language-driven image generation, the task of visualizing the contents of a linguis¬ 
tic message, as encoded in word embeddings, by generating a real image. Language-driven image 
generation can serve as evaluation tool providing intuitive visualization of what computational repre¬ 
sentations of word meaning encode. More ambitiously, effective language-driven image generation 
could complement image search and retrieval, producing images for words that are not associated 
to images in a certain collection, either for sparsity, or due to their inherent properties (e.g., artists 
and psychologists might be interested in images of abstract or novel words). In this work, we focus 
on generating images for distributed representations encoding the meaning of single words. How¬ 
ever, given recent advances in compositional distributed semantics (2^ that produce embeddings 

* Research carried out in Center for Mind/Brain Sciences, University of Trento 
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Figure 1: Generated images of 10 concepts per category for 20 basic categories, grouped my macro¬ 
category. See supplementary materials for the answer key. 


for arbitrarily long linguistic units, we also see our contribution as the first step towards generating 
images depicting the meaning of phrases (e.g., blue car) and sentences. After all, language-driven 
image generation can be seen as the symmetric goal of recent research (e.g., (Tjlll) that introduced 
effective methods to generate linguistic descriptions of the contents of a given image. 

To perform language-driven image generation, we combine various recent strands of research. Tools 
such as word2vec (El and Glove have been shown to produce extremely high-quality vector- 
based word embeddings. At the same time, in computer vision, images are effectively represented 
by vectors of abstract visual features, such as those extracted by Convolutional Neural Networks 
(CNNs) 0. Consequently, the problem of translating between linguistic and visual representations 
has been coached in terms of learning a cross-modal mapping function between vector spaces Sim. 
Finally, recent work in computer vision, motivated by the desire to achieve a better understanding 
of what the layers of CNNs and other deep architectures have really learned, has feature 

inversion techniques that map a representation in abstract visual feature space (e.g., from the top 
layer of a CNN) back onto pixel space, to produce a real image 1241 fTOl . 

Our language-driven image generation system takes a word embedding as input (e.g., the word2vec 
vector for grasshopper), projects it with a cross-modal function onto visual space (e.g., onto a rep¬ 
resentation in the space defined by a CNN layer), and then applies feature inversion to it (using the 
method HOGgles method of to generate an actual image (cell A18 in Figure |^. We test our 
system in a rigorous zero-shot setup, in which words and images of tested concepts are neither used 
to train cross-modal mapping, nor employed to induce the feature inversion function. So, for ex¬ 
ample, our system mapped grasshopper onto visual and then pixel space without having ever been 
exposed to grasshopper pictures. 

Figure illustrates our results (“answer key” for the figure provided as supplementary material). 
While it is difficult to discriminate among similar objects based on these images, the figure shows 
that our language-driven image generation method already captures the broad gist of different do¬ 
mains (food looks like food, animals are blobs in a natural environment, and so on). 
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2 Language-driven image generation 


2.1 From word to visual vectors 

Up to now, feature inversion algorithms (TOl [22l [24]| have been applied to visual representations 
directly extracted from images (hence the “inversion” name). We aim instead at generating an image 
conveying the semantics of a concept as encoded in a word representation. Thus, we need a way to 
“translate” the word representation into a visual representation, i.e., a representation laying on the 
visual space that conveys the corresponding visual semantics of the word. 

Cross-modal mapping has been first introduced in the context of zero-shot learning as a way to 
address the manual annotation bottleneck in domains where other vector-based representations (e.g., 
images or brain signals) must be associated to word labels |[l3j|T9l. This is achieved by using training 
data to learn a mapping function from vectors in the domain of interest to vector representations of 
word labels. In our case, we are interested in the general ability of cross-modal mapping to translate 
a representation between different spaces, and specifically from a word to a visual feature space. 

The mapping is performed by inducing a function / : ^ from data points {wi, Vi), where 

iCi G M^Ws a word representation and Vi G the corresponding visual representation. The 
mapping function can then be applied to any given word vector Wj to obtain its projection Vj = 
f{wj) onto visual space. Following previous work |[T3l l4ll. we assume that the mapping is linear. 
To estimate its parameters M G given word vectors W paired with visual vectors V, we 

use Elastic-Net-penalized least squares regression, that linearly combines the LI and L2 weight 
penalties of Lasso and Ridge regularization: 

M= argmin ||WM - V||i. + Ai||M||i + A 2 ||M||i. (1) 


By modifying the weights of the LI and L2 penalties, Ai and A 2 , we can derive different regression 
methods. Specifically, we experiment with plain regression (Ai = 0, A 2 = 0), ridge regression 
(Ai = 0, A 2 7 ^ 0), lasso regression (Ai 7 ^ 0 and A 2 = 0) and symmetric elastic net (Ai = A 2 , Ai 7 ^ 0). 

2.2 From visual vectors to images 

Convolutional Neural Networks have recently surpassed human performance on object recogni¬ 
tion ini- Nevertheless, these models exhibit “intriguing properties”, that are somewhat surprising 
given their state-of-the-art performance 12 T 1 , prompting an effort to reach a deeper understanding of 
how they really work. Given that these models consist of millions of parameters, there is ongoing 
research on feature inversion of different CNN layers to attain an intuitive visualization of what each 
of them learned. 

Several methods have been proposed for inverting CNN visual features, however, the exact nature 
of the task imposes certain constraints on the inversion method. For example, the original work of 
Zeiler and Fergus 1241 cannot be straightforwardly adapted to our task of generating images from 
word embeddings, since their DeConvNet method requires information related to the activations of 
the network in several layers. In this work, we adopt the framework of Vondrick et al. 1^ that casts 
the problem of inversion as paired dictionary learning^ 

Specifically, given an image xq G and its visual representation y = 0(xo) G the goal is to 
find an image x* that minimizes the reconstruction error: 

X* = argmin \\<t){x) - y\\l (2) 

Given that there are no guarantees regarding the convexity of 0, both images and visual represen¬ 
tations are approximated by paired, over-complete bases, U G and V G respec¬ 

tively. Enforcing U and V to have paired representations through shared coefficients a G i.e., 
xq = Ua and y = Va, allows the feature inversion to be done by estimating such coefficients a 
that minimize the reconstruction error. Practically, the algorithm proceeds by finding U, V and a 

^Originally, the HOGgles method of ED was introduced for visualizing HOG features. However, the 
method does not make feature-specific assumptions and it has also recently been used to invert CNN fea¬ 
tures 1^ . 
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through a standard sparse coding method. For learning the parameters, the algorithm is presented 
with training data of the form yi), where Xi is an image patch and yi the corresponding visual 
vector associated with that patch. 

3 Experimental Setup 

3.1 Materials 

Dreamed Concepts We refer to the words we generate images for as dreamed concepts. The 
dreamed word set comes from the concepts studied by McRae et al. ITTIl . in the context of property 
norm generation. This set contains 541 base-level concrete concepts (e.g., cat, apple, car etc.) that 
span across 20 general and broad categories (e.g., animal, fruit/vegetable, vehicle etc). For the 
purposes of the current experiments, 69 McRae concepts were excluded (either because of high 
ambiguity or for technical reasons), resulting in 472 dreamed words we test on. 

Seen Concepts We refer to the set of words associated to real pictures that are used for training 
purposes as seen concepts. The real picture set contains approximately 480K images extracted from 
ImageNet representing 5K distinct concepts. The seen concepts are used for training the cross- 
modal mapping. Importantly, the dreamed and seen concept sets do emphnot overlap. 

Word Representations For all seen and dreamed concepts, we build 300-dimensional word vec¬ 
tors with the word2vec toolkit0choosing the CBOW method|jCBOW, which learns to predict a tar¬ 
get word from the ones surrounding it, produces state-of-the-art results in many linguistic tasks O . 
Word vectors are induced from a language corpus (e.g., Wikipedia) of 2.8 billion words|^ 

Visual Representations The visual representations, for the set of 480K seen concept images, are 
extracted with the pre-trained CNN model of (91 through the Caffe toolkit 0|. CNNs trained on 
natural images learn a hierarchy of increasingly more abstract properties: the features in the bottom 
layers resemble Gabor filters, while features in the top layers capture more abstract properties of 
the dataset or tasks the CNN is trained for (see 1241 ) (e.g., the topmost layer captures a distribution 
over training labels). In this work, we experiment with feature representations extracted from two 
levels, pool-5, extracted from the 5th layer (6x6x256=9216 dimensions), and fc-7, extracted from the 
7th layer (1x4096 dimensions), pool-5 is an intermediate pooling layer that should capture object 
commonalities, fc-7 is a fully-connected layer just below the topmost one, and as such it is expected 
to capture high-level discriminative features of different object classes. 

Since each seen concept is associated with many images, we experiment with two ways to derive 
a unique visual representation. Inspired from categorization schemes in cognitive science (141, 
we will refer to them as the prototype and exemplar methods. The prototype visual vector of a 
concept is constructed by averaging the visual representations (either pool-5 or fc-7) of images tagged 
in ImageNet with the concept. The averaging method should smooth out noise and emphasize 
invariances in images associated to a concept. On the other hand, the constructed prototype does not 
correspond to an actual depiction of the concept. The exemplar visual vector, on the other hand, is a 
single visual vector that is picked as a good representative of the set, as it is the one with the highest 
average cosine similarity to all other vectors extracted from images labeled with the same concept. 

3.2 Model selection and parameter estimation 

Visual feature type and concept representations In order to determine the optimal visual feature 
type (between pool-5 and fc-7) and concept representation method (between prototype and exemplar), 
we set up a human study through CrowdFlower|^For 50 randomly chosen test concepts, we generate 

4 images, each obtained by inverting the visual vector computed by combining a feature type with 

^https://code.google.com/p/word2vec/ 

^Other hyperparameters, adopted without tuning, include a context window size of 5 words to either side of 
the target, setting the sub-sampling option to le-05 and estimating the probability of target words by negative 
sampling, drawing 10 samples from the noise distribution na. 

^Corpus sources: http : //wacky . sslmit. unibo . it. http : //www. natcorp .ox.ac.uk 
“http://www.crowdflower.com/ 
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a concept representation method, e.g., for pool-5+prototype, we generate an image by inverting the 
visual vector resulting from averaging the pool-5 feature vectors extracted from images labeled with 
the test concept (details on our implementation of feature inversion below). Participants are then 
asked to judge which of the 4 images is more likely to denote the test concept. For each test con¬ 
cept, we collect 20 judgments. Overall, participants showed a strong significant preference for the 
images generated from inverting pool-5 feature vectors (28/50), and in particular for those that were 
generated from pool-5 by inverting feature vectors constructed with the exemplar protocol (18/50)|^ 
The following experiments were thus carried out using the pool-5-i-exemplar visual space. 

Cross-modal mapping To learn the mapping M of Equation[^ we use 5K training pairs (Wc, Vc) 
= {wc G Vc G where Wc is the word vector and Vc is the visual vector for the 

(seen) concept c, based on pool-5 features and exemplar representation. Spec ifically, we estimate the 
weights M by training the 4 regression methods described in Section |2.1| above, cross-validating 
the values of Ai and A 2 on the training data. Model selection is performed by conducting a human 
study on the language-driven image generation task. For the same test of 50 concepts as above, we 
obtain estimates of their visual vectors v by mapping their word vectors into visual space through 
the different mapping functions M. We then generate an image by inverting the visual features v. 
Participants are again asked to judge which of the 4 images is more likely to denote the test concept. 
For each concept we collected 20 judgments. Participants showed a preference for plain regression 
(9/50 significant tests in favor of this model), which we adopt in rest of the paper. 


Feature inversion Training data for feature inversion (Section 2.2 above) are created by using 
the PASCAL VOC 2011 dataset, that contains 15K images of 20 distinct objects. Note that the 20 
PASCAL objects are not part of our dreamed concepts, and thus the feature inversion is performed 
in a zero-shot way (the inversion will be asked to generate an image for a concept that it has never 
encountered before). In order to increase the size of the training data, from each image we de¬ 
rived several image patches Xi associated with different parts of the image and paired them with 
their equivalent visual representations yi. Both paired dictionary learning and feature inversion are 
conducted using the HOGgles software fT2\ with default hyperparameters Q 


4 Experiments 

Ligure provides a snapshot of our results; we randomly picked 10 dreamed concepts from each 
of the 20 McRae categories, and we show the image we generated for them from the corresponding 
word embeddings, as described in Section We stress again that the images of dreamed concepts 
were never used in any step of the pipeline, neither to train cross-modal mapping, nor to train fea¬ 
ture inversion, so they are genuinely generated in a zero-shot manner, by leveraging their linguistic 
associations to seen concepts. 

Not surprisingly, the images we generate are not as clear as those one would get by retrieving existing 
images. However, we see in the figure that concepts belonging to different categories are clearly 
distinguished, with the exception of food and fruit/vegetable (columns 12 and 13), that look very 
much the same (on the other hand, fruit and vegetable are also food, and word vectors extracted 
from corpora will likely emphasize this “functional” role of theirs). 

We next present a series of user studies providing quantitative and qualitative insights into the infor¬ 
mation that subjects can extract from the visual properties of the generated images. 

4.1 Experiment 1: Correct word vs. random confounder 

The first experiment is a sanity check, evaluating whether the visual properties in the generated 
images are informative enough for subjects to guess the correct label against a random alternative. 

Experiment description Participants are presented with the generated image of a dreamed con¬ 
cept and are asked to judge if it is more likely to denote the correct word or a confounder randomly 

^Throughout this paper, statistical significance is assessed with two-tailed exact binomial tests with thresh¬ 
old a < 0.05, corrected for multiple comparisons with the false discovery rate method. 

'https://github.com/CSAILVision/ihog 
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picked from the seen word set. Given that the confounder is a randomly picked item, the task is 
relatively easy. However, both confounders and dreamed concepts are concrete, basic-level con¬ 
cepts, so they are sometimes related just by chance. Moreover, the confounders were used to train 
the mapping and inversion functions, which could have introduced a systematic bias in their favour. 
We test the 472 dreamed concepts, collecting 20 ratings for each via CrowdFlower. Word order is 
randomized both across and within trials (the same setup is used in the following experiments, with 
image order also randomized). 

Results Participants show a consistent preference for the correct word (dreamed concept) (median 
proportion of votes in favor: 75%). Preference for the correct word is significantly different from 
chance in 211/472 cases. Participants expressed a significant preference for the confounder in 10 
cases only, and in the majority of those, dreamed concepts and their confounders shared similar 
properties, e.g., cape-tabletop (both made of textile), zehra-haboon (both mammals), oak-boathouse 
(existing in similar natural environments). 

The experiment confirms that our method can generally capture at least those visual properties of 
dreamed concepts that can distinguish them from visually dissimilar random items. 

4.2 Experiment 2: Correct image vs. image of similar concept 

The second experiment ascertains to what extent subjects can pick the right generated image for a 
dreamed concept over a closely related alternative. 

Experiment description For each dreamed concept, we pick as confounder the closest seman¬ 
tic neighbor according to the subject-based conceptual distance statistics provided by McRae et 
al. (ID. In 379/472 cases, the confounder belongs to the category of the dreamed concept; hence, 
distinguishing the two concepts is quite challenging (e.g., mandarin vs. pumpkin). Participants were 
presented with the images generated from the dreamed concept and the confounder, and they were 
asked which of the two images is more likely to denote the dreamed concept. 

Results Results and examples are provided in TableIn the vast majority of cases (409/472) the 
participants did not show a significant preference for either the correct image or the confounder. This 
shows that the current image generation pipeline does not capture, yet, fine-grained properties that 
would allow within-category discrimination. Still, within the subset of 63 cases for which subjects 
did express a significant preferences, we observe a clear trend in favour of the correct image (41 
vs. 22). Color and environment seem to be the fine-grained properties that determined many of 
the subjects’ right or wrong choices within this subset. Of the 63 pairs, 14 involve concepts from 
different categories, and 49 same-category pairs. Of the former, in 11/14 the preference was for the 
right image. In 2 of the 3 wrong cases, the dreamed concept vs. intruder pairs have similar color 
{emerald vs. parsley, bowl vs. dish), while neither concept has a typical discriminative color in the 
third case {thermometer vs. marble). Even in the challenging same-category group, 30/49 pairs 
display the right preference. In particular, subjects distinguished objects that typically have different 
colors {Q.g., flamingo vs. partridge), or live in different environments (e.g., turtle vs. tortoise). In 
the remaining 19 within-category cases in which the confounder was preferred, color seems again 
to play a crucial role in the confusion (e.g., alligator vs. crocodile, asparagus vs. spinach). 

We next ran a follow-up experiment to find out to what extent the lack of precision of our algorithm 
should be attributed to noise in image generation from abstract visual features, independently of the 
linguistic origin of the signal. For these purposes, we replaced the visual feature vector produced by 
cross-modal mapping with the “gold-standard” visual vector for each dreamed/confounder concept 
(e.g., instead of mapping the partridge word vector onto visual space, we generated a partridge 
image by inverting a pool-5+exemplar vector directly extracted from a set of images labeled with this 
word obtained from ImageNet). We repeated the Experiment 2 setup using the images generated in 
this way. In this case, the number of pairs for which no significant preference emerged was 75.4% 
(356/472), in 17.6% (83/472) of the cases there was a significant preference for the correct image, 
and in 7% (33/472) for the confounder. The results in this setting are better than when visual features 
are derived from word representations, but not dramatically so. Since feature inversion is an active 
area of research in computer vision, we can thus expect that the quality of language-driven image 
generation will greatly improve simply in virtue of general advances in image generation methods. 
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In favor of dreamed concept 
8.6% (41/472) 

Same category Different category 



flamingo partridge helicopter shotgun 



In favor of confounder 


4.6% (22/472) 

Same category Different category 




alligator crocodile bowl dish 



turtle tortoise barn cabinet 



pumpkin mandarin whale bison 


sailboat boat emerald parsley 



asparagus spinach thermometer marble 


Table 1: Proportion and examples of cases where subjects significantly preferred the dreamed con¬ 
cept image (left) or the confounder (right). Pairs of images depict dreamed and confounder concepts. 
Bold marks the dreamed concept that subjects were asked to pick the image for. E.g., subjects were 
presented with the bowl and dish images, were asked to decide which one contains a bowl, and 
preferred the image of the dish. 


^^^^^Predicted 

Gold 

MAN-MADE 

ORGANIC 

ANIMAL 

Pref. 

No Pref. 

Total 

MAN-MADE 

128 

9 

5 

142 

124 

266 

ORGANIC 

0 

48 

1 

49 

19 

68 

ANIMAL 

9 

14 

33 

56 

72 

128 


Table 2: Confusion matrix for experiment 3: rows report gold categories, columns report subjects’ 
responses (the first 3 columns count cases with significant preferences for one macro-category only). 


4.3 Experiment 3: Judging macro-categories of objects 

The previous experiments have shown that our language-driven image generation system visualizes 
properties that are salient and relevant enough to distinguish unrelated concepts (Experiment 1) but 
not closely related ones (Experiment 2). The last experiment takes high-level category structure 
explicitly into account in the design. 

Experiment description We group the McRae categories into three macro-categories, namely 
ANIMAL vs. ORGANIC VS. MAN-MADE, that are widely recognized in cognitive science as funda¬ 
mental and unambiguous ifTTIl . Participants are given a generated image and are asked to pick the 
macro-category that best describes the object in it. 

Results Again, the number of images for which participants’ preferences are not significant is 
high: 28% of the ORGANIC images, 47% of the MAN-MADE images and 56% of the animal 
images. However, when participants do show significant preference, in the large majority of cases 
it is in favor of the correct macro-category: this is so for 98% of the ORGANIC images (70.5% of 
total), 90% of the man-made images (48% of total), and 59% of the animal ones (25.7% of total). 
Table|^reports the confusion matrix across the macro-categories. Confusions arise where one would 
expect them: both man-made and animal images are more often confused with ORGANIC things 
than with each other. 

Again, color (either of the object itself or of the environment) is the leading property, distinguishing 
objects among the three macro-categories. As Eigurej^ shows, orange, green and a darker mixture of 
colors characterize ORGANIC things, ANIMALS, and MAN-MADE objects respectively. Images that 
do not typically have these colors are harder to be recognized. Eor instance, the few mistakes for 
ORGANIC images belong to the natural object category (e.g., rocks); all the other categories within 
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Figure 2: Distribution of macro-category preferences across the gold concepts of the man-made 
(left), ORGANIC (middle) and ANIMAL (right) categories. 


this macro-category are in the vast majority of the cases judged correctly. In the MAN-MADE macro¬ 
category (Figure]^ left), the images of buildings are those more easily recognizable; as one can see in 
Figurethose images share the same pattern: two horizontal layers (land/dark and sky/blue) with a 
vertical structure cutting across them (the building itself). Similarly, vehicles display two layers with 
a small horizontal structure crossing them, and they are almost always correctly classified. Finally, 
within the ANIMAL macro-category (Figure]^ right), birds and fish are more often misclassified than 
other animals , with their typical environment probably playing a role in the confusion. 

5 Discussion 

We introduced the new task of generating pictures visualizing the semantic content of linguistic 
expressions as encoded in word embeddings, proposing more specifically a method we dubbed 
language-driven image generation. 

The current system seems capable to visualize the typical color of object classes and aspects of their 
characteristic environment. Interestingly, vector-based word representations are notoriously bad at 
capturing color m , and we do not expect them to be much better at characterizing environments, so 
our results suggest that, already in its current form, our system could also be used to enrich word 
representations, by highlighting aspects of concepts that are not salient in language but are probably 
learned by similarity-based generalization from the cross-modal mapping training examples. In this 
sense, language-driven image generation is more than a simple word embedding evaluation tool. At 
the same time, our system completely ignores visual properties related to shape. Shapes are not often 
expressed by linguistic means (although we all recognize the typical “gestalt” of, say, a mammal, 
it is very difficult to describe it in words), but in the same way in which we can capture color and 
environment, better visual representations or feature inversion methods might lead us in the future 
to associate, by means of images, typical shapes to shape-blind linguistic representations. 

Currently we approach language-based image generation as a two-step process. Inspired from recent 
work in caption generation that conditions word production on visual vectors, we plan to explore an 
end-to-end model that conditions the generation process on information encoded in the word em¬ 
beddings of the word/phrase that we wish to produce an image for, building upon classic generative 
models of image generation miEi. 
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A Answer Keys to Figure 1 


We provide the concept names of the word embeddings used to generate the images of Figure 1 (we 
provide again Figure 1 in this document to facilitate the readers^ Due to lack of space, we split the 
concept names into 3 tables, Table 1-3, where each table provides the concept names of the word 
embeddings used to generate the man-made, organic and animal images respectively. 


Man-made Organic Animals 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 


Figure 3: Generated images of 10 concepts per category for 20 basic categories, grouped my macro¬ 
category. 
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1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

A 

dishwasher 

ashtray 

bed 

accordion 

apron 

anchor 

axe 

balloon 

airplane 

barn 

apartment 

B 

freezer 

bag 

bench 

bagpipe 

armour 

banner 

baton 

ball 

ambulance 

building 

basement 

C 

fridge 

barrel 

bookcase 

banjo 

belt 

blender 

bayonet 

doll 

bike 

bungalow 

bedroom 

D 

microwave 

basket 

bureau 

cello 

blouse 

bolts 

bazooka 

football 

boat 

cabin 

bridge 

E 

oven 

bathtub 

cabinet 

clarinet 

boots 

book 

bomb 

kite 

buggy 

cathedral 

cellar 

F 

projector 

bottle 

cage 

drum 

bracelet 

brick 

bullet 

marble 

bus 

chapel 

elevator 

G 

radio 

bowl 

carpet 

flute 

buckle 

broom 

cannon 

racquet 

canoe 

church 

escalator 

H 

sink 

box 

catapult 

guitar 

camisole 

brush 

crossbow 

rattle 

cart 

cottage 

garage 

I 

stereo 

bucket 

chair 

harmonica 

cape 

candle 

dagger 

skis 

car 

house 

pier 

J 

stove 

cup 

sofa 

harp 

cloak 

crayon 

shotgun 

toy 

helicopter 

hut 

bridge 


Table 3: Concept names of word embeddings used to generate man-made images. 



12 

13 

14 

1 15 1 

16 

17 

18 

19 

20 

A 

biscuit 

apple 

beehive 

birch 

A 

blackbird 

whale 

grasshopper 

alligator 

bear 

B 

bread 

asparagus 

bouquet 

cedar 

B 

bluejay 

octopus 

hornet 

crocodile 

beaver 

C 

cake 

avocado 

emerald 

dandelion 

C 

budgie 

clam 

moth 

frog 

bison 

D 

cheese 

banana 

muzzle 

oak 

D 

buzzard 

cod 

snail 

iguana 

buffalo 

E 

pickle 

beans 

pearl 

pine 

E 

canary 

crab 

ant 

python 

bull 

F 

pie 

beets 

rock 

prune 

F 

chickadee 

dolphin 

beetle 

rattlesnake 

calf 

G 

raisin 

blueberry 

seaweed 

vine 

G 

flamingo 

eel 

butterfly 

salamander 

camel 

H 

rice 

broccoli 

shell 

willow 

H 

partridge 

goldflsh 

caterpillar 

toad 

caribou 

I 

cake 

cabbage 

stone 

birch 

I 

dove 

guppy 

cockroach 

tortoise 

cat 

J 

biscuit 

cantaloupe 

muzzle 

pine 

J 

duck 

mackerel 

flea 

cheetah 

cheetah 


Table 4: Concept names of word embeddings used to generate 
ORGANIC images. 


Table 5: Concept names of word embeddings used to generate 
ANIMAL images. 



